# A tibble: 3 × 2
name band
<chr> <chr>
1 Mick Stones
2 John Beatles
3 Paul Beatles
# A tibble: 3 × 2
name plays
<chr> <chr>
1 John guitar
2 Paul bass
3 Keith guitar
tidyverse, grouping and formulas
5 Jun 2025
I owe a debt of gratitude to many people as the thoughts and code in these slides are the process of years-long development cycles and discussions with my team, friends, colleagues and peers. When someone has contributed to the content of the slides, I have credited their authorship.
Images are either directly linked, or generated with StableDiffusion or DALL-E. That said, there is no information in this presentation that exceeds legal use of copyright materials in academic settings, or that should not be part of the public domain.
Warning
You may use any and all content in this presentation - including my name - and submit it as input to generative AI tools, with the following exception:
Materials
Gisteren hebben we deze onderwerpen behandeld:
Vandaag leren we:
With an inner join, we combine two data frames based on a common key. Only the rows with matching keys in both data frames are kept.
With a left join, we keep all rows from the left data frame and only the matching rows from the right data frame. If there is no match, the result will contain NA for the columns from the right data frame.
With a right join, we keep all rows from the right data frame and only the matching rows from the left data frame. If there is no match, the result will contain NA for the columns from the left data frame.
With a full join, we keep all rows from both data frames. If there is no match, the result will contain NA for the columns from the other data frame.
tidyverse and the data analysis cycleLeading principle: language of programming should really behave like a language, tidyverse.
tidyverse: a few key verb that perform common types of data manipulation.
The tidyverse packages operate on tidy data:
Each column is a variable
Each row is an observation
Each cell is a single value
Untidy versus tidy data
dplyr packagedplyrThe dplyr package is a specialized package for working with data.frames (and the related tibble) to transform and summarize tabular data:
dplyr cheatsheetdplyr?In combination with the pipe operator %>%:
dplyr functions in this lectureThere are many functions available in dplyr, but we will focus on just the following dplyr functions (verbs):
| dplyr verbs | Description |
|---|---|
glimpse() |
a transposed print of the data that shows all variables |
select() |
selects variables (columns) based on their names |
filter() |
subsets the rows of a data frame based on their values |
arrange() |
re-order or arrange rows |
mutate() |
adds new variables, or new variables that are functions of existing variables |
summarise() |
creates a new data frame with statistics of the variables (optional grouped by another variables) |
group_by() |
allows for group operations in the “split-apply-combine” concept |
Check the dplyr cheat sheet for examples.
dplyr::glimpse()str(), but shows more data.str() shows more detailed information about data structure.dplyr::glimpse(planets)
Rows: 8
Columns: 4
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial planet…
$ diameter <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.883
$ rotation <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
str(planets)
'data.frame': 8 obs. of 4 variables:
$ planet_type: Factor w/ 2 levels "Gas giant","Terrestrial planet": 2 2 2 2 1 1 1 1
$ diameter : num 0.382 0.949 1 0.532 11.209 ...
$ rotation : num 58.64 -243.02 1 1.03 0.41 ...
$ rings : logi FALSE FALSE FALSE FALSE TRUE TRUE ...dplyr::select()Select variables type and diameter from the planets data frame:
dplyr::select()Select numerical variables with where(is.numeric):
dplyr::filter()Selects subsets of the rows of a data frame based on their values.
Select the planets that have a ring and that are gas giants:
Shortcut key: Ctrl/Cmd + Shift + M
%>% worksPasses (transformed) data on to the next operation
avoids nested code
avoids creation of intermediate objects
Basic principle of %>% operator
dplyr::select() with the pipe operatordplyr::select without %>%
planet_type diameter
Mercury Terrestrial planet 0.382
Venus Terrestrial planet 0.949
Earth Terrestrial planet 1.000
Mars Terrestrial planet 0.532
Jupiter Gas giant 11.209
Saturn Gas giant 9.449
Uranus Gas giant 4.007
Neptune Gas giant 3.883
Straightforward and might be more familiar to those used to base R functions.
dplyr::select with %>%
planet_type diameter
Mercury Terrestrial planet 0.382
Venus Terrestrial planet 0.949
Earth Terrestrial planet 1.000
Mars Terrestrial planet 0.532
Jupiter Gas giant 11.209
Saturn Gas giant 9.449
Uranus Gas giant 4.007
Neptune Gas giant 3.883
More readable and flexible when chaining multiple dplyr functions.
dplyr::mutate()dplyr::mutate() adds a new variable to the data frame.
Arguments:
.keep specifies which variables to return, “all”, “used”, “unused”, “none”.
.before or .after determine where the new variables are inserted.
dplyr::mutate()Example: compute a new variable rotation_diameter = rotation/diameter, add it to the data frame and keep all other variables:
Rows: 8
Columns: 5
$ planet_type <fct> Terrestrial planet, Terrestrial planet, Terrestrial …
$ diameter <dbl> 0.382, 0.949, 1.000, 0.532, 11.209, 9.449, 4.007, 3.…
$ rotation <dbl> 58.64, -243.02, 1.00, 1.03, 0.41, 0.43, -0.72, 0.67
$ rings <lgl> FALSE, FALSE, FALSE, FALSE, TRUE, TRUE, TRUE, TRUE
$ rotation_diameter <dbl> 153.50785340, -256.08008430, 1.00000000, 1.93609023,…
The pipe operations do not make changes to the original data set, unless you save the results:
Temporary:
Changes saved in new data frame:
dplyr::arrange()Order the rows of the planets data set on ascending values of diameter:
Original data set:
planet_type diameter rotation rings
Mercury Terrestrial planet 0.382 58.64 FALSE
Venus Terrestrial planet 0.949 -243.02 FALSE
Earth Terrestrial planet 1.000 1.00 FALSE
Mars Terrestrial planet 0.532 1.03 FALSE
Jupiter Gas giant 11.209 0.41 TRUE
Saturn Gas giant 9.449 0.43 TRUE
Uranus Gas giant 4.007 -0.72 TRUE
Neptune Gas giant 3.883 0.67 TRUE
Ordered data set, based on diameter:
planet_type diameter rotation rings
Mercury Terrestrial planet 0.382 58.64 FALSE
Mars Terrestrial planet 0.532 1.03 FALSE
Venus Terrestrial planet 0.949 -243.02 FALSE
Earth Terrestrial planet 1.000 1.00 FALSE
Neptune Gas giant 3.883 0.67 TRUE
Uranus Gas giant 4.007 -0.72 TRUE
Saturn Gas giant 9.449 0.43 TRUE
Jupiter Gas giant 11.209 0.41 TRUE
dplyrSuppose we want to perform the following transformations:
planets on ascending values of rotation> 1planet_type, diameter and rotationWith base R code:
subset(planets[order(planets$rotation), ],
subset = diameter > 1,
select = c(planet_type, diameter,
rotation)) planet_type diameter rotation
Uranus Gas giant 4.007 -0.72
Jupiter Gas giant 11.209 0.41
Saturn Gas giant 9.449 0.43
Neptune Gas giant 3.883 0.67
With dplyr and the pipe %>% operator
summarise()The dplyr function for summarizing data:
mean_diameter sd_diameter
1 3.926375 4.226738
mean(), median(), sd(), var(), sum(), for numeric variablesn(), n_distinct() for counts?dplyr::select and cheat sheet)group_by()The dplyr function for grouping rows of a data frame is very useful in combination with summarise()
Example: group the planets based on having rings (or not) and compute the mean and the standard deviation for each group.
RCalculations based on missing values (NA’s) are not possible in R:
There are two easy ways to perform “listwise deletion”:
dplyrNo solution for missing values:
mean_variable sd_variable
1 NA NA
Use na.rm = TRUE:
Code with a single pipe operator on one line and spaces around %>%:
Code with multiple pipe operators on multiple lines:
but definitely NOT:
tidyverse style guidehttps://style.tidyverse.org/index.html
Gerko Vink @ Anton de Kom Universiteit, Paramaribo